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Abstract 

We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the 
skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. 
The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target 
variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), 
which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use 
eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure 
returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness 
of fit to new data and quality of the network structure with respect to the true dependence structure of the data. 
Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results to 
characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the 
joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series 
of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown 
to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering 
different application domains. Overall, our experiments support the conclusions that local structural learning with 
H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning 
framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for 
the empirical tests are publicly available. 
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1. Introduction 


A Bayesian network (BN) is a probabilistic model 
formed by a structure and parameters. The structure of a 
BN is a directed acyclic graph (DAG), whilst its param¬ 
eters are conditional probability distributions associated 
with the variables in the model. The problem of finding 
the DAG that encodes the conditional independencies 
present in the data attracted a great deal of interest over 
the last years (iRodrigues de M orals & Auss^, 2010a : 
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Ideally the DAG should coincide with the dependence 
structure of the global distribution, or it should at least 
identify a distribution as close as possible to the correct 
one in the probability space. This step, called structure 
learning, is similar in approaches and terminology to 
model selection procedures for classical statistical mod¬ 
els. Basically, constraint-based (CB) learning methods 
systematically check the data for conditional indepen¬ 
dence relationships and use them as constraints to con- 

























































































struct a partially oriented graph representative of a BN 
equivalence class, whilst search-and-score (SS) meth¬ 
ods make use of a goodness-of-fit score function for 
evaluating graphical structures with regard to the data 
set. Hybrid methods attempt to get the best of both 
worlds: they learn a skeleton with a CB approach and 
constrain on the DAGs considered during the SS phase. 

In this study, we present a novel hybrid algorithm 
for Bayesian network structure learning, called H2 PcQ. 
It first reconstructs the skeleton of a Bayesian net¬ 
work and then performs a Bayesian-scoring greedy hill¬ 
climbing search to orient the edges. The algorithm is 
based on divide-and-conquer constraint-based subrou¬ 
tines to learn the local structure around a target vari¬ 
able. HPC may be thought of as a way to compen¬ 
sate for the large number of false negatives at the output 
of the weak PC learner, by performing extra computa¬ 
tions. As this may arise at the expense of the number 
of false positives, we control the expected proportion of 
false discoveries (i.e. false positive nodes) among all 
the discoveries made in PC?-. We use a modification 
of the Incremental association Markov boundary algo¬ 
rithm (IAMB), initially develop ed by Tsamardinos et 
al. in ( Tsamard inos et all 20031) and later modified by 
Jose Pefia in dPenal 2008 ) to control the FDR of edges 
when learning Bayesian network models. HPC scales 
to thousands of variables and can deal with many fewer 
samples (n < q). To illustrate its performance by means 
of empirical evidence, we conduct two series of exper¬ 
imental comparisons of H2PC against Max-Min Hill- 
Climbing (MMHC), which is currently the most pow¬ 
erful state;oFthe 2 ar^_^orithm for BN structure learn¬ 
ing ( Tsamardinos et al. . 2006h . using well-known BN 
benchmarks with various data sizes, to assess the good¬ 
ness of fit to new data as well as the quality of the 
network structure with respect to the true dependence 
structure of the data. 

We then address a real application of H2PC where 
the true dependence structure is unknown. More specif¬ 
ically, we investigate H2PC’s ability to encode the joint 
distribution of the label set conditioned on the input 
features in the multi-label classification (MLC) prob¬ 
lem. Many challenging applications, such as photo 
and video annotation and web page categorization. 
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MLC focuses on the exploitation of the label condi¬ 
tional dependency in order to better predict the label 
combination for each example. We show that local BN 
structure discovery methods offer an elegant and pow¬ 
erful approach to solve this problem. We establish two 
theorems (Theorem 6 and 7) linking the concepts of 
marginal Markov boundaries, joint Markov boundaries 
and so-called label powersets under the faithfulness as¬ 
sumption. These Theorems offer a simple guideline to 
characterize graphically : i) the minimal label powerset 
decomposition, (i.e. into minimal subsets Yz,p c Y such 
that Y/,/) X Y \ Yip I X), and ii) the minimal subset of 
features, w.r.t an Information Theory criterion, needed 
to predict each label powerset, thereby reducing the in¬ 
put space and the computational burden of the multi¬ 
label classification. To solve the MLC problem with 
BNs, the DAG obtained from the data plays a pivotal 
role. So in this second series of experiments, we assess 
the comparative ability of H2PC and MMHC to encode 
the label dependency structure by means of an indirect 
goodness of fit indicator, namely the 0/1 loss function, 
which makes sense in the MLC context. 

The rest of the paper is organized as follows: In 
the Section 2, we review the theory of BN and dis¬ 
cuss the main BN structure learning strategies. We then 
present the H2PC algorithm in details in Section 3. Sec¬ 
tion 4 evaluates our proposed method and shows results 
for several tasks involving artificial data sampled from 
known BNs. Then we report, in Section 5, on our exper¬ 
iments on real-world data sets in a multi-label learning 
context so as to provide empirical support for the pro¬ 
posed methodology. The main theoretical results appear 
formally as two theorems (Theorem 8 and 9) in Section 
5. Their proofs are established in the Appendix. Fi¬ 
nally, Section 6 raises several issues for future work and 
we conclude in Section 7 with a summary of our contri¬ 
bution. 


2. Preliminaries 


We define next some key concepts used along the pa¬ 
per and state some results that will support our analysis. 
In this paper, upper-case letters in italics denote random 
variables (e.g., X, Y) and lower-case letters in italics de¬ 
note their values (e.g., x,y). Upper-case bold letters de¬ 
note random variable sets (e.g., X, Y, Z) and lower-case 
bold letters denote their values (e.g., x, y). We denote 
by X X Y I Z the conditional independence between X 
and Y given the set of variables Z. To keep the notation 
uncluttered, we use p{y \ x) to denote p(Y = y | X = x). 
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2.1. Bayesian networks 

Formally, a BN is a tuple < Q,P >, where Q -< 
U, E > is a directed acyclic graph (DAG) with nodes 
representing the random variables U and P a joint prob¬ 
ability distribution in 14. In addition, Q and P must 
satisfy the Markov condition; every variable, X e U, 
is independent of any subset of its non-descendant vari¬ 
ables conditioned on the set of its parents, denoted by 
Paf. From the M arkov condition, it is easy to prove 


(iNeapolitaru. 120041) that the joint probability distribution 
P on the variables in U can be factored as follows : 


pm = P(Xu ..., X„) = P] P(X, |Paf) 


( 1 ) 


Equation 1 allows a parsimonious decomposition of 
the joint distribution P. It enables us to reduce the prob¬ 
lem of determining a huge number of probability values 
to that of determining relatively few. 

A BN structure Q entails a set of conditional indepen¬ 
dence assumptions. They can all be identified by the d- 
separation criterion ( Pearl . 19881) . We use X ±g YjZ to 
denote the assertion that X is d-separated from Y given 
Z in ff. Formally, X ±g T|Z is true when for every undi¬ 
rected path in ^ between X and Y, there exists a node 
W in the path such that either (1) VF does not have two 
parents in the path and IF e Z, or (2) VF has two parents 
in the path and neither W nor its descendants is in Z. If 
< ^, P > is a BN, X ±p Y\Z if X ±g Y\Z. The converse 
does not necessarily hold. We say that < Q,P > satis¬ 
fies the faithfulness condition if the d-separations in 
identify all and only the conditional independencies in 
P, i.e., X ±p Y\Z iff X ±g Y\Z. 

A Markov blanket of T is any set of variables 
such that T is conditionally independent of all the re¬ 
maining variables given Mj-. By extension, a Markov 
blanket of T in V guarantees that Mj- c V, and that T 
is conditionally independent of the remaining variables 
in V, given Mj-. A Markov boundary, MB7-, of T is any 
Markov blanket such that none of its proper subsets is a 
Markov blanket of T. 

We denote by PC^, the set of parents and children 
of T in Q, and by SP^, the set of spouses of T in Q. 
The spouses of T are the variables that have common 
children with T. These sets are unique for all 0, such 
that < Q,P > satisfies the faithfulness condition and so 
we will drop the superscript Q. We denote by dSepjX), 
the set that d-separates X from the (implicit) target T. 

Theorem 1. Suppose < @,P > satisfies the faithfulness 
condition. Then X and Y are not adjacent in Q iff3Z e 
U \ {X, Y} such that X JL Y \ Z . Moreover, MB^ = 

PCx U SPx ■ 


A proof can be found for instance in (INeapolitan , 


12004. 


Two graphs are said equivalent iff they encode the 
same set of conditional independencies via the d- 
separation criterion. The equivalence class of a DAG 
^ is a set of DAG s that are eq uivalent to Q. The next 
result showed by ( Pearll 1988h . establishes that equiv¬ 
alent graphs have the same undirected graph but might 
disagree on the direction of some of the arcs. 


Theorem 2. Two DAGs are equivalent iff they have the 
same underlying undirected graph and the same set of 
v-structures (i.e. converging edges into the same node, 
such as X ^ Y Z). 


Moreover, an equivalence class of network structures 
can be uniquely represented by a partially directed DAG 
(PDAG), also called a DAG pattern. The DAG pattern is 
defined as the graph that has the same links as the DAGs 
in the equivalence class and has oriented all and only the 
edges common to the DAGs in the equivalence class. 
A structure learning algorithm from data is said to be 
correct (or sound) if it returns the correct DAG pattern 
(or a DAG in the correct equivalence class) under the 
assumptions that the independence tests are reliable and 
that the learning database is a sample from a distribution 
P faithful to a DAG Q, The (ideal) assumption that the 
independence tests are reliable means that they decide 
(in)dependence iff the (in)dependence holds in P. 


2.2. Conditional independence properties 


T he following three theor ems, borr owed 
from Pena et al. ( 2007 ), are proven in Pearll ( 1988 ): 


Theorem 3. Let X, Y, Z and W denote four mutu¬ 
ally disjoint subsets of U. Any probability distribu¬ 
tion p satisfies the following four properties: Symme¬ 
try X. JL Y \ Z ^ Y XX|Z, Decomposition, 
X X (Y U W) I Z ^ X X Y I Z, Weak Union 
X X (Y U W) I Z X X Y I (Z U W) and Contraction, 
XxY|(ZuW)aXxW|Z^Xx(YuW)|Z. 


Theorem 4. If p is strictly positive, then p satisfies the 
previous four properties plus the Intersection property 
XxY|(ZuW)aXxW|(ZuY)^Xx(YU 
W) I Z. Additionally, each X e U has a unique Markov 
boundary, MBx 

Theorem 5. If p is faithful to a DAG Q, then p satisfies 
the previous five properties plus the Composition prop- 
erfy X X Y I Z A X X W I Z ^ X X (Y U W) I Z ant/ 
the local Markov property X X (NDx \ Pax) I Pax /or 
each X e U, where NDx denotes the non-descendants 
of X in 0. 
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2.3. Structure learning strategies 

The number of DAGs, G, is super-exponential in the 
number of random variables in the domain and the prob¬ 
lem of learning the most prob able a posteriori BN from 


in terms of ti me efficiency and q uality of reconstruc- 


data is worst-case NP-hard (IChickering et all 1200411 . 


One needs to resort to heuristical methods in order to 
be able to solve very large problems effectively. 

Both CB and SS heuristic approaches have advan¬ 
tages and disadvantages. CB approaches are relatively 
quick, deterministic, and have a well defined stop¬ 
ping criterion; however, they rely on an arbitrary sig¬ 
nificance level to test for independence, and they can 
be unstable in the sense that an error early on in the 
search can have a cascading effect that causes many er¬ 
rors to be present in the final graph. SS approaches 
have the advantage of being able to flexibly incor¬ 
porate users’ background knowledge in the form of 
prior probabilities over the structures and are also capa¬ 
ble of dealing with incomplete records in the database 
(e.g. EM technique). Although SS methods are fa¬ 
vored in practice when dealing with small dimensional 
data sets, they are slow to converge and the computa¬ 
tional complexity of ten prevents us from finding op 


timal BN structures (iPerrier et al.L l2008t iKoiima et al. 


algorithms 


(Koivisto & Soodl 

2004: 

Silander & Mvllvmaki, 

2006; 

Cussens & Bartlett 

, 2013 

: Studenv & Haws, 2014t) and 


a decomposable score like BDeu, the computational 
complexity remains exponential, and therefore, such al¬ 
gorithms are intractable for BNs with more than around 
30 vertices on current workstations. For larger sets of 
variables the computational burden becomes prohibitive 
and restrictions about the structure have to be imposed, 
such as a limit on the size of the parent sets. With 
this in mind, the ability to restrict the search locally 
around the target variable is a key advantage of CB 
methods over SS methods. They are able to construct 
a local graph around the target node without having 


itv (IPena et al.. 

20071 iRodri 

gues de Morals & Aussem, 

2010blallTsamardinos et al. 

20061 Penal 12008). 


With a view to balancing the computation cost with 
the desired accuracy of the estimates, several hybrid 
methods have bee n proposed recently. Ts amardinos 
et al. proposed in ( Tsamardinos et al. . 2006h the Min- 
Max Hill Climbing (MMHC) algorithm and conducted 
one of the most extensive empirical comparison per¬ 
formed in recent years showing that MMHC was the 
fastest and the most accurate method in terms of 
structural error based on the structural hamming dis¬ 
tance. More specifically, MMHC outperformed both 


tion the PC ( Spirtes et al., 200^ . the Sparse Candi¬ 


date (iFriedma net al.Lll999b) , the T hree Phase Depen- 


dencv A nalysis Jc heng et ak. 2002 ). the Optimal Rein¬ 


sertion ( Moore & Wond. 20031). the Greedy Equiva¬ 
lence Search (IChickerina 1200211 . and the Greedy Hill- 
Climbing Search on a variety of networks, sample 
sizes, and parameter values. Although MMHC is rather 
heuristic by nature (it returns a local optimum of the 
score function), it is currently considered as the most 
powerful state-of-the-art algorithm for BN structure 
learning capable of dealing with thousands of nodes in 
reasonable time. 

In order to enhance its performance on small dimen 


siona l data sets, Perrier et al. proposed in (IPerrier et al 


20081) a hybrid algorithm that can learn an optimal BN 
(i.e., it converges to the true model in the sample limit) 
when an undirected graph is given as a structural con¬ 
straint. They defined this undirected graph as a super¬ 
structure (i.e., every DAG considered in the SS phase is 
compelled to be a subgraph of the super-structure). This 
algorithm can learn optimal BNs containing up to 50 
vertices when the average degree of the super-structure 
is around two, that is, a sparse structural constraint is 
assumed. To extend its feasibility to BN with a few hun¬ 
dred of vertices and an average degree up to four, Ko- 
jima et al. proposed in ( Koiima et al. . 2010h to divide 
the super-structure into several clusters and perform an 
optimal search on each of them in order to scale up to 
larger networks. Despite interesting improvements in 
terms of score and structural hamming distance on sev¬ 
eral benchmark BNs, they report running times about 
10^ times longer than MMHC on average, which is still 
prohibitive. 

Therefore, there is great deal of interest in hy¬ 
brid methods capable of improving the structural ac¬ 
curacy of both CB and SS methods on graphs con¬ 
taining up to thousands of vertices. However, they 
make the strong assumption that the skeleton (also 
called super-structure) contains at least the edges of 
the true network and as small as possible extra edges. 
While controlling the false discovery rate (i.e., false 
extra edges) in BN learning has attrac t ed some at¬ 
tention recently (Armen & Tsamardinosl 2011 : Penal 
2008 : Tsamardinos & Brownl 2008 ). to our knowledge, 
there is no work on controlling actively the rate of false¬ 
negative errors (i.e., false missing edges). 

2.4. Constraint-based structure learning 

Before introducing the H2PC algorithm, we discuss 
the general idea behind CB methods. The induction of 
local or global BN structures is handled by CB methods 
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through the identification of local neighborhoods (i.e., 
PCx), hence their scalability to very high dimensional 
data sets. CB methods systematically check the data for 
conditional independence relationships in order to in¬ 
fer a target’s neighborhood. Typically, the algorithms 
run either a or a independence test when the data 
set is discrete and a Fisher’s Z test when it is continu¬ 
ous in order to decide on dependence or independence, 
that is, upon the rejection or acceptance of the null hy¬ 
pothesis of conditional independence. Since we are 
limiting ourselves to discrete data, both the global and 
the local distributions are assumed to be multinomial, 
and the latter are represented as conditional probabil¬ 
ity tables. Conditional independence tests and network 
scores for discrete data are functions of these condi¬ 
tional probability tables through the observed frequen¬ 
cies {tiijic', i - 1,.. .,R', j - 1,..., C; k = 1,..., L} for 
the random variables X and Y and all the configurations 
of the levels of the conditioning variables Z. We use 
rij+ic as shorthand for the marginal 2 j riijk and similarly 
for rii+k, n++k and n+++ = n. We use a classic conditional 
independence test based on the mutual information. The 
mutual information is an information-theoretic distance 
measure defined as 




/=1 y=i k=\ 


^ijk ^ ^ijk^++k 
^i+k^-Yjk 


It is proportional to the log-likelihood ratio test 
(they differ by a 2n factor, where n is the sample 
size). The asymptotic null distribution is with 
{R - 1)(C - l)L degrees of freedom. For a detailed 
analysis of the ir properties we refer the reader to 
(lAgrestiL 120021) . The main limitation of this test is 
the rate of convergence to its limiting distribution, 
which is particularly problematic when dealing with 
small samples and sparse contingency tables. The 
decision of accepting or rejecting the null hypothe¬ 
sis depends implicitly upon the degree of freedom 
which increases exponentially with the number of 
variables in the conditional set. Several heuristic solu - 
tions hav e emerged in the literature (Snirtes et al. 


200C ; _ Rodrigues de Morais & A ussemi 2010a ; 


Tsamardinos et alfl2006[]Tsamardinos & Borboudakis . 


2010l) to overcome some shortcomings of the asymp 
totic tests. In this study we use the two following 
heuristics that are used in MMHC. First, we do not 
perform MI{X, Y\Z) and assume independence if 
there are not enough samples to achieve large enough 
power. We require that the average sample per count 
is above a user defined parameter, equal to 5, as in 


( Tsamardinos et al. . 2006h . This heuristic is called the 


power rule. Second, we consider as structural zero 
either case n+jk or tij+k - 0. For example, if n+jk - 0, 
we consider y as a structurally forbidden value for Y 
when Z = z and we reduce R by 1 (as if we had one 
column less in the contingency table where Z = z). 
This is known as the degrees of freedom adjustment 
heuristic. 


3. The H2PC algorithm 

In this section, we present our hybrid algorithm 
for Bayesian network structure learning, called Hy¬ 
brid HPC (H2PC). It first reconstructs the skeleton of 
a Bayesian network and then performs a Bayesian- 
scoring greedy hill-climbing search to filter and orient 
the edges. It is based on a CB subroutine called HPC to 
learn the parents and children of a single variable. So, 
we shall discuss HPC first and then move to H2PC. 


3.1. Parents and Children Discovery 

HPC (Algorithm [T]i can be viewed as an ensemble 
method for combining many weak PC learners in an at¬ 
tempt to produce a stronger PC learner. HPC is based 
on three subroutines; Data-Efficient Parents and Chil¬ 
dren Superset (DE-PCS), Data-Efficient Spouses Super¬ 
set (DE-SPS), and Incremental Association Parents and 
Children with false discovery rate control ( EDR- I APC) , 
a weak PC learner based on EDR-IAMB ( Pena . 2008 ) 
that requires little computation. HPC receives a target 
node T, a data set D and a set of variables U as input 
and returns an estimation of PC?-. It is hybrid in that 
it combines the benefits of incremental and divide-and- 
conquer methods. The procedure starts by extracting 
a superset PCSj of PCj (line 1) and a superset SPSj- 
of SPr (line 2) with a severe restriction on the max¬ 
imum conditioning size (Z <= 2) in order to signifi¬ 
cantly increase the reliability of the tests. A first can¬ 
didate PC set is then obtained by running the weak PC 
learner called EDR-IAPC on PCSr USPSr (line 3). The 
key idea is the decentralized search at lines 4-8 that in¬ 
cludes, in the candidate PC set, all variables in the su¬ 
perset PCSj- \ PCj that have T in their vicinity. So, 
HPC may be thought of as a way to compensate for the 
large number of false negatives at the output of the weak 
PC learner, by performing extra computations. Note 
that, in theory, X is in the output of PDR-IAPC(T) if 
and only if Y is in the output of PDR-IAPC(A). How¬ 
ever, in practice, this may not always be true, p articu¬ 
larly w hen working in high-dimensional domains (iPena , 


2008h . By loosening the criteria by which two nodes are 
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said adjacent, the effective restrictions on the size of the 
neighborhood are now far less severe. The decentral¬ 
ized search has a significant impact on the accuracy of 
HPC as we shall see in in the experiments . We proved in 
( iRodrigues de Morais & Aussem , 2010a ) that the orig¬ 
inal HPC(T) is consistent, i.e. its output converges in 
probability to PCj- , if the hypothesis tests are consis¬ 
tent. The proof also applies to the modified version pre¬ 
sented here. 

We now discuss the subroutines in more detail. FDR- 
lAPC (Algorithmic is a fast incremental method that 
receives a data set O and a target node T as its in¬ 
put and promptly returns a rough estimation of PCj, 
hence the term “weak” PC learner. In this study, we 
use FDR-IAPC because it aims at controlling the ex¬ 
pected proportion of false discoveries (i.e., false pos¬ 
itive nodes in PC?-) among all the discoveries made. 
FDR-IAPC is a straightforward extension of the algo¬ 
rithm lAMBFDR developed by Jose Pena in ( Penal 
20081) . which is itself a modification of the incremen- 
tal association Markov b oundary algorithm (IAMB) 
dTsamardinos et all l2003h . to control the expected pro¬ 
portion of false discoveries (i.e., false positive nodes) in 
the estimated Markov boundary. FDR-IAPC simply re¬ 
moves, at lines 3-6, the spouses SPj from the estimated 
Markov boundary MBj output by IAMB FDR at line 1, 
and returns PCj- assuming the faithfulness condition. 

The subroutines DE-PCS (Algorithmic and DE-SPS 
(AlgorithmlCl search a superset of PCr and SPr respec¬ 
tively with a severe restriction on the maximum condi¬ 
tioning size (|Z| <= 1 in DE-PCS and |Z| <= 2 in DE- 
SPS) in order to significantly increase the reliability of 
the tests. The variable filtering has two advantages : 
i) it allows HPC to scale to hundreds of thousands of 
variables by restricting the search to a subset of relevant 
variables, and ii) it eliminates many true or approximate 
deterministic relationships that produce many false neg- 
ati ve errors in the output of the algorithm, a s explained 
in dRodrigues de Morais & Aussem , 2010bllal) . DE-SPS 
works in two steps. Eirst, a growing phase (lines 4-8) 
adds the variables that are d-separated from the target 
but still remain associated with the target when condi¬ 
tioned on another variable from PCSj. The shrinking 
phase (lines 9-16) discards irrelevant variables that are 
ancestors or descendants of a target’s spouse. Pruning 
such irrelevant variables speeds up HPC. 


3.2. Hybrid HPC (H2PC) 

In this section, we discuss the SS phase. The follow¬ 


ing d iscussion draws strongly on (iTsamardinos et al. 


20061) as the SS phase in Hybrid HPC and MMHC are 
exactly the same. The idea of constraining the search 


Algorithm 1 HPC 

Require: T: target; D: data set; U: set of variables 
Ensure: PCr: Parents and Children of T 

1: [PCSr.dSep] ^D£-PCS(r,D,U) 

2: SPSr ^ DE-SPSiT, 2), U, PCSr, dSep) 

3: PCr <- FDR-IAPC{T,D, (T U PCSr U SPSr)) 

4: for all X 6 PCSr \ PCr do 

5: if r € FDR-IAPC(X, D, (T U PCSr U SPSr)) then 

6: PCr ^ PCr U A 

7: end if 

8: end for 


Algorithm 2 FDR-IAPC 

Require: T ; target; D\ data set; U: set of variables 

Ensure: PCr: Parents and children of T ; 

* Learn the Markov boundary of T 
1: MSt ^ IAMBFDR{X,D,\i) 

* Remove spouses ofT from MBr 
2: PCr ^— IVIBr 

3: for all A 6 MBr do 

4: if 3Z c (MBr \ X) such that T _L A | Z then 

5: PCr ^ PCr \ A 

6: end if 

7: end for 


Algorithm 3 DE-PCS 

Require: T : target; D: data set; U: set of variables; 

Ensure: PCSr: parents and children superset of T ; dSep: d- 
separating sets; 

Phase I: Remove X if T _L A 
1: PCSr <-U\r 
2: for all A 6 PCSr do 
3: if (T J_ A) then 

4: PCSr ^ PCSr \ A 

5: dSep(A) ^ 9 

6: end if 

7: end for 

Phase II: Remove X if T ±X\Y 

8: for all A 6 PCSr do 
9: for all Y 6 PCSr \ A do 

10: if (T _L A I T) then 

11: PCSr PCSr \ A 

12: dSep(A) ^ Y 

13: break loop EOR 

14: end if 

15: end for 

16: end for 
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Algorithm 4 DE-SPS 

Require: T: target; D: data set; U: the set of variables; 
PCS^: parents and children superset of T\ dSep: d- 
separating sets; 

Ensure: SPSj: Superset of the spouses of T ; 

1: SPSr <-0 

2: for all X e PCSr do 

3: SPS^ ^ 0 

4: foraliy £U\{ruPCSr|do 

5: if (T I y|dSep(y) U X) then 

6: SPS^ ^ SPSf U Y 

7: end if 

8: end for 

9: for all Y € SPS^ do 

10: for all Z e SPS^ \ Y do 

11: if (r _L y|A U Z) then 

12: SPS^ SPS^ \ Y 

13: break loop FOR 

14: end if 

15: end for 

16: end for 

17: SPSt «- SPSt U SPS^ 

18: end for 


to improve time-effic iency first appeared in the Sparse 
Candidate algorithm dpriedman et al. . 1999b ). It results 
in efficiency improvements over the (unconstrained) 
greedy search. All recent hybrid algorithms build on 
this idea, but employ a sound algorithm for identifying 
the candidate parent sets. The Hybrid HPC first identi¬ 
fies the parents and children set of each variable, then 
performs a greedy hill-climbing search in the space of 
BN. The search begins with an empty graph. The edge 
addition, deletion, or direction reversal that leads to the 
largest increase in score (the BDeu score with uniform 
prior was used) is taken and the search continues in a 
similar fashion recursively. The important difference 
from standard greedy search is that the search is con¬ 
strained to only consider adding an edge if it was dis¬ 
covered by HPC in the firs t phase. We extend the g reedy 


search with a TABU list (IFriedman et al.L Il999bl) . The 


list keeps the last 100 structures explored. Instead of ap¬ 
plying the best local change, the best local change that 
results in a structure not on the list is performed in an at¬ 
tempt to escape local maxima. When 15 changes occur 
without an increase in the maximum score ever encoun¬ 
tered during search, the algorithm terminates. The over¬ 
all best scoring structure is then returned. Clearly, the 
more false positives the heuristic allows to enter candi¬ 
date PC set, the more computational burden is imposed 
in the SS phase. 


4. Experimental validation on synthetic data 


Algorithm 5 Hybrid HPC 
Require: D: data set; U: set of variables 
Ensure: A DAG on the variables U 

1: for all pair of nodes A, T e U do 

2: Add X in PCy and Add Y in PCx if A e HPC(Y) and 

Y 6 HPC(X) 

3: end for 

4: Starting from an empty graph, perform greedy hill¬ 
climbing with operators add-edge, delete-edge, reverse- 
edge. Only try operator add-edge A —> T if K € PCx 


Before we proceed to the experiments on real-world 
multi-label data with H2PC, we first conduct an exper¬ 
imental comparison of H2PC against MMHC on syn¬ 
thetic data sets sampled from eight well-known bench¬ 
marks with various data sizes in order to gauge the prac¬ 
tical relevance of the H2PC. These BNs that have been 
previously used as benchmarks for BN learning algo¬ 
rithms (see Table[T]for details). Results are reported in 
terms of various performance indicators to investigate 
how well the network generalizes to new data and how 
well the learned dependence structure matches the true 
structure o f the benchmark netw ork. We implemented 
H2PC in R ( R Core Teatnl 2013) and integrate d the code 
into the bnlearn package from ( Scutari . 2010h . MMHC 
was implemented by Marco Scutari in bnlearn. The 
source code of H2PC as well as all data sets used for 
the empirical tests are publicly available H. The thresh¬ 
old considered for the type I error of the test is 0.05. 
Our experiments were carried out on PC with Intel(R) 


■ https://github.com/madbix/bnlearn-clone-3.4 


7 


























Table 1: Description of the BN benchmarks used in the experiments. 


network 

# var. 

# edges 

max. degree 
in/out 

domain 

range 

min/med/max 
PC set size 

child 

20 

25 

2/7 

2-6 

1/2/8 

insurance 

27 

52 

3/7 

2-5 

1/3/9 

mildew 

35 

46 

3/3 

3-100 

1/2/5 

alarm 

37 

46 

4/5 

2-4 

1/2/6 

hailfinder 

56 

66 

4/16 

2-11 

1/1.5/17 

muninl 

186 

273 

3/15 

2-21 

1/3/15 

pigs 

441 

592 

2/39 

3-3 

1/2/41 

link 

724 

1125 

3/14 

2-4 

0/2/17 


Core(TM) i5-3470M CPU @3.20 GHz 4Go RAM run- 
ning under Linux 64 bits. 

We do not claim that those data sets resemble real- 
world problems, however, they make it possible to com¬ 
pare the outputs of the algorithms with the known struc¬ 
ture. All BN benchmarks (structure and probability ta¬ 
bles) were downloaded from the bnlearn repositor}f|. 
Ten sample sizes have been considered: 50, 100, 200, 
500, 1000, 2000, 5000, 10000, 20000 and 50000. All 
experiments are repeated 10 times for each sample size 
and each BN. We investigate the behavior of both algo¬ 
rithms using the same parametric tests as a reference. 


4.1. Performance indicators 

We first investigate the quality of the skeleton re¬ 
turned by H2PC during the CB phase. To this end, we 
measure the false positive edge ratio, the precision (i.e., 
the number of true positive edges in the output divided 
by the number of edges in the output), the recall (i.e., the 
number of true positive edges divided the true number 
of edges) and a combination of precision and recall de¬ 
fined as -^(1 - precision)^ -H (1 - recall)^, to measure 
the Euclidean distan ce from perfect p recision and re¬ 
call, as proposed in (IPena et all l2007h . Second, to as¬ 
sess the quality of the final DAG output at the end of 
the SS phase, we report th e five performance indicators 
dScutari & Broginii 120121) described below: 


the posterior density of the network for the data 
it was learned from, as a measure of goodness of 
fit. It is known as the B ayesian Dirichlet equiva 


lent score (BD eu) from (iHeckerman et al.L 11995 


Buntine , 199lh and has a single parameter, the 


equivalent sample size, which can be thought of 
as the size of an imaginary sample supporting 
the prior distribution. The eq uivalent sample size 
was s et to 10 as suggested in (IKoller & Friedman . 
2009b : 


^ http:l/www. bnlearn. com/bnrepository 


• the BIC score (ISchwarzL Il978b of the network for 
the data it was learned from, again as a measure of 
goodness of fit; 


• the posterior density of the network for a new data 
set, as a measure of how well the network general¬ 
izes to new data; 


• the BIC score of the network for a new data set, 
again as a measure of how well the network gener¬ 
alizes to new data; 


• the Structural Hamming Distance (SHD) between 
the learned and the true structure of the network, 
as a measure of the quality of the learned depen¬ 
dence structure. The SHD between two PDAGs is 
defined as the number of the following operators 
required to make the PDAGs match: add or delete 
an undirected edge, and add, remove, or reverse the 
orientation of an edge. 


For each data set sampled from the true probability 
distribution of the benchmark, we first learn a network 
structure with the H2PC and MMHC and then we com¬ 
pute the relevant performance indicators for each pair 
of network structures. The data set used to assess how 
well the network generalizes to new data is generated 
again from the true probability structure of the bench¬ 
mark networks and contains 50000 observations. 

Notice that using the BDeu score as a metric of recon¬ 
struction quality has the following two problems. First, 
the score corresponds to the a posteriori probability of a 
network only under certain conditions (e.g., a Dirichlet 
distribution of the hyper parameters); it is unknown to 
what degree these assumptions hold in distributions en¬ 
countered in practice. Second, the score is highly sensi¬ 
tive to the equivalent sample size (set to 10 in our exper¬ 
iments) and depends on the network priors used. Since, 
typically, the same arbitrary value of this parameter is 
used both during learning and for scoring the learned 
network, the metric favors algorithms that use the BDeu 
score for learning. In fact, the BDeu score does not rely 
on the structure of the original, gold standard network at 
all; instead it employs several assumptions to score the 
networks. For those reasons, in addition to the score we 
also report the BIC score and the SHD metric. 


4.2. Results 

In Figure [1] we report the quality of the skeleton ob¬ 
tained with HPC over that obtained with MMPC (be¬ 
fore the SS phase) as a function of the sample size. 
Results for each benchmark are not shown here in de¬ 
tail due to space restrictions. For sake of conciseness. 
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False negative rate 
(lower is better) 



Recall 

(higher is better) 



Euclidian distance 
(lower is better) 
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(lower is better) 



Precision 
(higher is better) 



Number of statistical tests 



Figure 1: Quality of the skeleton obtained with HPC over that obtained with MMPC (after the CB phase). The 2 first figures present mean values 
aggregated over the 8 benchmarks. The 4 last figures present increase factors of HPC / MMPC, with the median, quartile, and most extreme values 
(green boxplots), along with the mean value (black line). 
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the performance values are averaged over the 8 bench¬ 
marks depicted in Table [T] The increase factor for a 
given performance indicator is expressed as the ratio of 
the performance value obtained with HPC over that ob¬ 
tained with MMPC (the gold standard). Note that for 
some indicators, an increase is actually not an improve¬ 
ment but is worse (e.g., false positive rate. Euclidean 
distance). For clarity, we mention explicitly on the sub¬ 
plots whether an increase factor > 1 should be inter¬ 
preted as an improvement or not. Regarding the qual¬ 
ity of the superstructure, the advantages of HPC against 
MMPC are noticeable. As observed, HPC consistently 
increases the recall and reduces the rate of false negative 
edges. As expected this benefit comes at a little expense 
in terms of false positive edges. HPC also improves 
the Euclidean distance from perfect precision and re¬ 
call on all benchmarks, while increasing the number of 
independence tests and thus the running time in the CB 
phase (see number of statistical tests). It is worth noting 
that HPC is capable of reducing by 50% the Euclidean 
distance with 50000 samples (lower left plot). These 
results are ve ry much in line with other experiments 
presented in d Rodrigues d e Morais & Aussem , 2010a : 
Villanueva & Maciel[l2012lf 


In Figure|2] we report the quality of the final DAG ob¬ 
tained with H2PC over that obtained with MMHC (af¬ 
ter the SS phase) as a function of the sample size. Re¬ 
garding BDeu and BIC on both training and test data, 
the improvements are noteworthy. The results in terms 
of goodness of fit to training data and new data using 
H2PC clearly dominate those obtained using MMHC, 
whatever the sample size considered, hence its ability to 
generalize better. Regarding the quality of the network 
structure itself (i.e., how close is the DAG to the true 
dependence structure of the data), this is pretty much 
a dead heat between the 2 algorithms on small sam¬ 
ple sizes (i.e., 50 and 100), however we found H2PC 
to perform significantly better on larger sample sizes. 
The SHD increase factor decays rapidly (lower is bet¬ 
ter) as the sample size increases (lower left plot). For 
50 000 samples, the SHD is on average only 50% that of 
MMHC. Regarding the computational burden involved, 
we may observe from Table |2] that H2PC has a little 
computational overhead compared to MMHC. The run¬ 
ning time increase factor grows somewhat linearly with 
the sample size. With 50000 samples, H2PC is approxi¬ 
mately 10 times slower on average than MMHC. This is 
mainly due to the computational expense incurred in ob¬ 
taining larger PC sets with HPC, compared to MMPC. 


5. Application to multi-label learning 


In this section, we address the problem of multi¬ 
label learning with H2PC. MLC is a challenging 
problem in many real-world application domains, 
where each instance can be assigned simultaneously 
to multiple binary labels ( Dembcz^AskietakL 2012; 
Read etal.. 2009: Madi^ov et al. T2012t Kocev et al. . 
20071 Tsoumakas et al.l 2010bh . Formally, learning 
from multi-label examples amounts to finding a map¬ 
ping from a space of features to a space of labels. We 
shall assume throughout that X (a random vector in R'^) 
is the feature set, Y (a random vector in {0,1J") is the 
label set, U = X U Y and p a probability distribution 
defined over U. Given a multi-label training set £), the 
goal of multi-label learning is to find a function which 
is able to map any unseen example to its proper set of 
labels. From the Bayesian point of view, this problem 
amounts to modeling the conditional joint distribution 
P(Y I X). 


5.1. Related work 


This MFC p roblem may be tackled in vario us ways 
(Fuaceset a .. 2()T5 Alvares-Cherman et al.. 2011 


Read et alT 20091" Blockeel et al.L 19981: iKocey et al 


20071). Each of these approaches is supposed to cap¬ 
ture - to some extent - the relationships between la¬ 
bels. The two most straightforward meta-learning 


methods (Madjarov et al.l 

201 

2) are: Binary Rele- 

vance (BR) (IFuaces et al.L 

201 

2) and Fabel Powerset 

(FP) (iTsoumakas & Vlahavas, 

ZOOTI Tsoumakas et al., 

2010b|). Both methods can be regarded as opposite in 


the sense that BR does consider each label indepen¬ 
dently, while FP considers the whole label set at once 
(one multi-class problem). An important question re¬ 
mains: what shall we capture from the statistical re¬ 
lationships between labels exactly to solve the multi¬ 
label classification proble m? The problem a tt racted 


a great deal of inter est (iDembczvAski et al.L 120121 : 


Zhang & Zhang . 2010l) . It is well beyond the scope 


and purpose of this paper to delve deeper into these ap 


proaches, we point the r eader to (IDembczvAski et al 


2012 : Tsoumakas et al. . 2010bh for a review. The 
second fundamental problem that we wish to address 
involves finding an optimal feature subset selection 
of a label set, w.r.t an Information Theory criterion 
Roller & Sahamil ( 19961) . As in the single-label case, 
multi-label feature selection has been studied recently 


and has encountere d some success (iGu et al.l 1201 1 
Snolaor et al. . 2013 ). 


10 















































































































log(BDeu post.) on train data 
(lower is better) 



log(BDeu post.) on test data 
(lower is better) 



BIC on train data 
(lower is better) 



BIC on test data 
(lower is better) 



OOOOOOOOOO 
LOOOOOOOOOO 
t-CNJLOOOOOOO 
•I- CM uo o o o 
1- CM un 

sample size 


Structural Hamming Distance 
(lower is better) 


Number of scores 




Figure 2: Quality of the final DAG obtained with H2PC over that obtained with MMHC (after the SS phase). The 6 figures present increase factors 
of HPC / MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line). 
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Table 2: Total running time ratio R (H2PC/MMHC). White cells indicate a ratio R < 1 (in favor of H2PC), while shaded cells indicate a ratio R > 1 
(in favor of MMHC). The darker, the larger the ratio. 


,, , Sample Size 

Network _i_ 



50 

100 

200 

500 

1000 

2000 

5000 

10000 

20000 

50000 












child 

0.94 

0.87 

1.14 

1.99 

2.26 

2.12 

2.36 

2.58 

1.75 

1.78 

±0.1 

±0.1 

±0.1 

±0.2 

±0.1 

±0.2 

±0.4 

±0.3 

±0.6 

±0.5 

insurance 

0.96 

1.09 

1.56 

2.93 

3.06 

3.48 

3.69 

4.10 

3.76 

3.75 

±0.1 

±0.1 

±0.1 

±0.2 

±0.3 

±0.4 

±0.3 

±0.4 

±0.6 

±0.5 

mildew 

0.77 

0.80 

0.79 

0.94 

1.01 

1.23 

1.74 

2.14 

3.26 

6.20 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.2 

±0.2 

±0.6 

±1.0 

alarm 

0.88 

1.11 

1.75 

2.43 

2.55 

2.71 

2.65 

2.80 

2.49 

2.18 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.2 

±0.2 

±0.3 

±0.6 

hailfinder 

0.85 

0.85 

1.40 

1.69 

1.83 

2.06 

2.13 

2.12 

1.95 

1.96 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.1 

±0.2 

±0.2 

±0.6 

muninl 

0.77 

0.85 

0.93 

1.35 

2.11 

4.30 

12.92 

23.32 

24.95 

24.76 

±0.0 

±0.0 

±0.0 

±0.0 

±0.0 

±0.2 

±0.7 

±2.6 

±5.1 

±6.7 

pigs 

0.80 

0.80 

4.55 

4.71 

5.00 

5.62 

7.63 

11.10 

14.02 

11.74 

±0.0 

±0.0 

±0.1 

±0.1 

±0.2 

±0.2 

±0.3 

±0.6 

±1.7 

±3.2 

link 

1.16 

1.93 

2.76 

5.55 

7.04 

8.19 

10.00 

13.87 

15.32 

24.74 

±0.0 

±0.0 

±0.0 

±0.1 

±0.2 

±0.2 

±0.3 

±0.4 

±2.5 

±4.2 












all 

0.89 

1.04 

1.86 

2.70 

3.11 

3.71 

5.39 

7.75 

8.44 

9.64 

±0.1 

±0.1 

±1.2 

±1.5 

±1.9 

±2.2 

±4.0 

±7.3 

±8.4 

±9.7 


5.2. Label powerset decomposition 

We shall first introduce the concept of label powerset 
that will play a pivotal role in the factorization of the 
conditional distribution p{Y \ X). 

Definition 1. Y^p c Y is called a label powerset ijf 
Yz,p X Y \ Yz,p I X. Additionally, Ypp is said minimal 
if it is non empty and has no label powerset as proper 
subset. 

Let LP denote the set of all powersets defined over 
U, and m/nLP the set of all minimal label powersets. 
It is easily shown {Y, 0} c LP. The key idea behind 
label powersets is the decomposition of the conditional 
distribution of the labels into a product of factors 


2012h . Therefore, we seek a fac¬ 
torization into a product of minimal factors in order to 
facilitate the estimation of the mode of the conditional 
distribution (also called the most probable explanation 
(MPE)): 


(PembczvAski et ^ 


L L 

max/7(y | x) = fl maxp(yz.p | x) = Ff max/7(yLP | mpp ) 
y If ytP: i. f yiP: 

The next section aims to obtain theoretical results for 
the characterization of the minimal label powersets Ypp. 
and their Markov boundaries from a DAG, in or¬ 

der to be able to estimate the MPE more effectively. 


L L 

p{Y I X) = [] p{Ypp^ I X) = n I 

AI 

where {Y/,P|,..., Y/.^^} is a partition of label power- 
sets and Mpp, is a Markov blanket of Ypp. in X. From 
the above definition, we have Ypp. X Ypp | X, Vi it j. 
In the framework of MLC, one can consider a mul¬ 
titude of loss functions. In this study, we focus on 
a non label-wise decomposable loss function called 
the subset 0/1 loss which generalizes the well-known 
0/1 loss from the conventional to the multi-label set¬ 
ting. The risk-minimizing prediction for subset 0/1 
loss is simply given by the mode of the distribution 


5.3. Label powerset characterization 

In this section, we show that minimal label powersets 
can be depicted graphically when p is faithful to a DAG, 

Theorem 6. Suppose p is faithful to a DAG Q. Then, 
Yi and Yj belong to the same minimal label powerset if 
and only if there exists an undirected path in Q between 
nodes Yi and Yj in Y such that all intermediate nodes Z 
are either (i) Z eY, or (ii) Z e'X. and Z has two parents 
in Y (i.e. a collider of the form Yp ^ X <— Yq). 

The proof is given in the Appendix. We shall now 
address the following questions: What is the Markov 
boundary in X of a minimal label powerset? Answering 
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this question for general distributions is not trivial. We 
establish a useful relation between a label powerset and 
its Markov boundary in X when p is faithful to a DAG, 

Theorem 7. Suppose p is faithful to a DAG Q. Let 
'^LP — {Yi,Y 2 , ■ ■ ■ ,Yn} be a label powerset. Then, its 
Markov boundary M in U is also its Markov boundary 
in X, and is given in ^byM — Uj=ifPCFj U SPy^.) \ Y. 

The proof is given in the Appendix. 

5.4. Experimental setup 

The MLC problem is decomposed into a series of 
multi-class classification problems, where each multi¬ 
class variable encodes one label powerset, with as many 
classes as the number of possible combinations of la¬ 
bels, or those present in the training data. At this point, 
it should be noted that the LP method is a special case of 
this framework since the whole label set is a particular 
label powerset (not necessarily minimal though). The 
above procedure can been summarized as follows: 

1. Learn the BN local graph Q around the label set; 

2. Read off Q the minimal label powersets and their 
respective Markov boundaries; 

3. Train an independent multi-class classifier on each 
minimal LP, with the input space restricted to its 
Markov boundary in 

4. Aggregate the prediction of each classifier 
to output the most probable explanation, i.e. 
argmaXj,/7(y | x). 

To assess separately the quality of the minimal label 
powerset decomposition and the feature subset selection 
with Markov boundaries, we investigate 4 scenarios: 

• BR without feature selection (denoted BR): a clas¬ 
sifier is trained on each single label with all fea¬ 
tures as input. This is the simplest approach as it 
does not exploit any label dependency. It serves as 
a baseline learner for comparison purposes. 

• BR with feature selection (denoted BR-l-MB): a 
classifier is trained on each single label with the 
input space restricted to its Markov boundary in 
0. Compared to the previous strategy, we evaluate 
here the effectiveness of the feature selection task. 

• Minimum label powerset method without feature 
selection (denoted MLP): the minimal label pow¬ 
ersets are obtained from the DAG. All features are 
used as inputs. 


• Minimum label powerset method with feature se¬ 
lection (denoted MLP-l-MB): the minimal label 
powersets are obtained from the DAG. the input 
space is restricted to the Markov boundary of the 
labels in that powerset. 


We use the same base learner in each meta-learning 
method: the w ell-known Random Forest classifier 
dBreimanl 1200 ih . RF achieves good performance in 
stand ard classification as well as in mu l ti-labe l prob¬ 
lems ( Kocev et al. . 2007 : Madiarov et ak . 2012 ). and is 
able to handle both continuous and discrete data eas¬ 
ily, which is mu ch appreciated. The sta ndard RF imple¬ 
mentation in R ( Liaw & Wiener , 2002 3 was used. For 
practical purposes, we restricted the forest size of RF to 
100 trees, and left the other parameters to their default 
values. 


5.4.1. Data and performance indicators 

A total of 10 multi-label data sets are collected for ex¬ 
periments in this section, whose characteristics are sum¬ 
marized in Table [3] These data sets come from differ¬ 
ent problem domains including texk_ biology, and mu¬ 
sic. They can be found on the Mulai 
for im age which comes from Zhoi 


□ r epository, except 
I 3 ( Maron & Ratan . 
1998h . Let D be the multi-label data set, we use 


\0\, dim{D), L{!D), FfD) to represent the number of ex¬ 
amples, number of features, number of possible labels, 
and feature type respectively. DL(D) - |{T|3ji: : (x, Y) e 
D}\ counts the number of distinct label combinations ap¬ 
pearing in the data set. Continues values are binarized 
during the BN structure learning phase. 


Table 3: Data sets characteristics 


data set 

domain 

m 

dim{T>) 

F(V) 

L(D) 

DLCD) 

emotions 

music 

593 

72 

cont. 

6 

27 

yeast 

biology 

2417 

103 

cont. 

14 

198 

image 

images 

2000 

135 

cont. 

5 

20 

scene 

images 

2407 

294 

cont. 

6 

15 

slashdot 

text 

3782 

1079 

disc. 

22 

156 

genbase 

biology 

662 

1186 

disc. 

27 

32 

medical 

text 

978 

1449 

disc. 

45 

94 

enron 

text 

1702 

1001 

disc. 

53 

753 

bibtex 

text 

7395 

1836 

disc. 

159 

2856 

corelSk 

images 

5000 

499 

disc. 

374 

3175 


The performance of a multi-label classifier 
can be assessed by several evaluation measures 
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^ http://crcui.r-project.org/web/packages/randomForest 
"http://mulan.sourceforge.net/datasets.html 
^ http://lamda.nju.edu.cn/data_MIMLimage.ashx 






















(iTsoumakas et al.Ll2010al) . We focus on maximizing a 
non-decomposable score function: the global accuracy 
(also termed subset accuracy, complement of the 
0/1-loss), which measures the correct classification 
rate of the whole label set (exact match of all labels 
required). Note that the global accuracy implicitly 
takes into account the label correlations. It is therefore 
a very strict evaluation measure as it requires an exact 
match of the predicted an d the true set of labels. I t 
was recently proved in (IDembczvAski et all 12012h 
that BR is optimal for decomposable loss functions 
(e.g., the hamming loss), while non-decomposable 
loss functions (e.g. subset loss) inherently require the 
knowledge of the label conditional distribution. 10-fold 
cross-validation was performed for the evaluation of 
the MLC methods. 


5.4.2. Results 

Table |4]reports the outputs of H2PC and MMHC, Ta¬ 
ble |5] shows the global accuracy of each method on the 
10 data sets. Table |6] reports the running time and the 
average node degree of the labels in the DAGs obtained 
with both methods. Figures |3] up to [T] display graph¬ 
ically the local DAG structures around the labels, ob¬ 
tained with H2PC and MMHC, for illustration purposes. 

Several conclusions may be drawn from these exper¬ 
iments. First, we may observe by inspection of the av¬ 
erage degree of the label nodes in Table that several 
DAGs are densely connected, like scene or bibtex, while 
others are rather sparse, like genbase, medical, corelSk. 
The DAGs displayed in Figures [3 up to [T] lend them¬ 
selves to interpretation. They can be used for encod¬ 
ing as well as portraying the conditional independen¬ 
cies, and the d-sep criterion can be used to read them 
off the graph. Many label powersets that are reported 
in Table |4] can be identified by graphical inspection of 
the DAGs. For instance, the two label powersets in 
yeast (Figure [2 bottom plot) are clearly noticeable in 
both DAGs. Clearly, BNs have a number of advantages 
over alternative methods. They lay bare useful infor¬ 
mation about the label dependencies which is crucial if 
one is interested in gaining an understanding of the un¬ 
derlying domain. It is however well beyond the scope 
of this paper to delve deeper into the DAG interpreta¬ 
tion. Overall, it appears that the structures recovered 
by H2PC are significantly denser, compared to MMHC. 
On bibtex and enron, the increase of the average label 
degree is the most spectacular. On bibtex (resp. enron), 
the average label degree has raised from 2.6 to 6.4 (resp. 
from 1.2 to 3.3). This result is in nice agreement with 
the experiments in the previous section, as H2PC was 
shown to consistently reduce the rate of false negative 


edges with respect to MMHC (at the cost of a slightly 
higher false discovery rate). 

Second, Table 0] is very instructive as it reveals that 
on emotions, image, and scene, the MLP approach boils 
down to the LP scheme. This is easily seen as there is 
only one minimal label powerset extracted on average. 
In contrast, on genbase the MLP approach boils down 
to the simple BR scheme. This is confirmed by inspect¬ 
ing Table |5] the performances of MMHC and H2PC 
(without feature selection) are equal. An interesting ob¬ 
servation upon looking at the distribution of the label 
powerset size shows that for the remaining data sets, the 
MLP mostly decomposes the label set in two parts: one 
on which it performs BR and the other one on which 
it performs LP. Take for instance enron, it can be seen 
from Table 0] that there are approximately 23 label sin¬ 
gletons and a powerset of 30 labels with H2PC for a to¬ 
tal of 53 labels. The gain in performance with MLP over 
BR, our baseline learner, can be ascribed to the quality 
of the label powerset decomposition as BR is ignoring 
label dependencies. As expected, the results using MLP 
clearly dominate those obtained using BR on all data 
sets except genbase. The DAGs display the minimum 
label powersets and their relevant features. 

Third, H2PC compares favorably to MMHC. On 
scene for instance, the accuracy of MLP-i-MB has raised 
from 20% (with MMHC) to 56% (with H2PC). On 
yeast, it has raised from 7% to 23%. Without feature 
selection, the difference in global accuracy is less pro¬ 
nounced but still in favor of H2PC. A Wilcoxon signed 
rank paired test reveals statistically significant improve¬ 
ments of H2PC over MMHC in the MLP approach with¬ 
out feature selection (p < 0.02). This trend is more pro¬ 
nounced when the feature selection is used (p < 0.001), 
using MLP-MB. Note however that the LP decompo¬ 
sition for H2PC and MMHC on yeast and image are 
strictly identical, hence the same accuracy values in Ta- 
ble|5]in the MLP column. 

Fourth, regarding the utility of the feature selection, 
it is difficult to reach any conclusion. Whether BR or 
MLP is used, the use of the selected features as inputs 
to the classification model is not shown greatly benefi¬ 
cial in terms of global accuracy on average. The per¬ 
formance of BR and our MLP with all the features out¬ 
performs that with the selected features in 6 data sets 
but the feature selection leads to actual improvements 
in 3 data sets. The difference in accuracy with and with¬ 
out the feature selection was not shown to be statisti¬ 
cally significant (p > 0.20 with a Wilcoxon signed rank 
paired test). Surprisingly, the feature selection did ex¬ 
ceptionally well on genbase. On this data set, the in¬ 
crease in accuracy is the most impressive: it raised from 
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7% to 98% which is atypical. The dramatic increase in 
accuracy on genbase is due solely to the restricted fea¬ 
ture set as input to the classification model. This is also 
observed on medical, to a lesser extent though. Inter¬ 
estingly, on large and densely connected networks (e.g. 
bibtex, slashdot and corelSK), the feature selection per¬ 
formed very well in terms of global accuracy and signif¬ 
icantly reduced the input space which is noticeable. On 
emotions, yeast, image, scene and genbase, the method 
reduced the feature set down to nearly 1/100 its origi¬ 
nal size. The feature selection should be evaluated in 
view of its effectiveness at balancing the increasing er¬ 
ror and the decreasing computational burden by drasti¬ 
cally reducing the feature space. We should also keep 
in mind that the feature relevance cannot be defined in¬ 
dependently of the learner and the model-performance 
metric (e.g., the loss function used). Admittedly, our 
feature selection based on the Markov boundaries is not 
necessarily optimal for the base MLC learner used here, 
namely the Random Forest model. 

As far as the overall running time performance is con¬ 
cerned, we see from Table that for both methods, the 
running time grows somewhat exponentially with the 
size of the Markov boundary and the number of fea¬ 
tures, hence the considerable rise in total running time 
on bibtex. H2PC takes almost 200 times longer on bib¬ 
tex (1826 variables) and enron (1001 variables) which 
is quite considerable but still affordable (44 hours of 
single-CPU time on bibtex and 13 hours on enron). We 
also observe that the size of the parent set with H2PC 
is 2.5 (on bibtex) and 3.6 (on enron) times larger than 
that of MMHC (which may hurt interpretability). In 
fact, the running time is known to increase exponen¬ 
tially with the parent set size of the true underlying net¬ 
work. This is mainly due the computational overhead 
of greedy search-and-score procedure with larger par¬ 
ent sets which is the most promising part to optimize in 
terms of computational gains as we discuss in the Con¬ 
clusion. 

6. Discussion & practical applications 

Our prime conclusion is that H2PC is a promising 
approach to constructing BN global or local structures 
around specific nodes of interest, with potentially thou¬ 
sands of variables. Concentrating on higher recall val¬ 
ues while keeping the false positive rate as low as pos¬ 
sible pays off in terms of goodness of fit and struc¬ 
ture accuracy. Historically, the main practical diffi¬ 
culty in the application of BN structure discovery ap¬ 
proaches has been a lack of suitable computing re¬ 
sources and relevant accessible software. The number 


Table 4: Distribution of the number and the size of the minimal label 
powersets output by H2PC (top) and MMHC (bottom). On each data 
set: mean number of powersets, minimum/median/maximum number 
of labels per powerset, and minimum/median/maximum number of 
distinct classes per powerset. The total number of labels and distinct 
labels combinations is recalled for convenience. 


H2PC 


dataset 

#LPs 

# labels / LP 
min/med/max 

L{V) 

# classes / LP 
min/med/max 

DL{V) 

emotions 

1.0 ±0.0 

6/6/6 

(6) 

26 / 27 / 27 

(27) 

yeast 

2.0 ± 0.0 

2/7/12 

(14) 

3/64/135 

(198) 

image 

1.0 ±0.0 

5/5/5 

(5) 

19/20/20 

(20) 

scene 

1.0 ±0.0 

6/6/6 

(6) 

14/ 15/15 

(15) 

slashdot 

9.5 ±0.8 

1/1/14 

(22) 

1/2/111 

(156) 

genbase 

26.9 ± 0.3 

1/1/2 

(27) 

1/2/2 

(32) 

medical 

38.6 ± 1.8 

1/1/6 

(45) 

1/2/10 

(94) 

enron 

24.5 ± 1.9 

1/1/30 

(53) 

1/2/334 

(753) 

bibtex 

39.6 ±4.3 

1/1/112 

(159) 

2/2/1588 

(2856) 

corel5k 

97.7 ± 3.9 

1/1/263 

(374) 

1/2/2749 

(3175) 


MMHC 

dataset 

#LPs 

# labels / LP 
min/med/max 

L(D) 

# classes / LP 
min/med/max 

DL{V) 

emotions 

1.2 ±0.4 

1/6/6 

(6) 

2/26/27 

(27) 

yeast 

2.0 ± 0.0 

1/7/13 

(14) 

2/89/186 

(198) 

image 

1.0 ±0.0 

5/5/5 

(5) 

19/20/20 

(20) 

scene 

1.0 ±0.0 

6/6/6 

(6) 

14/ 15/15 

(15) 

slashdot 

15.1 ±0.6 

1/1/9 

(22) 

1/2/46 

(156) 

genbase 

27.0 ± 0.0 

1/1/1 

(27) 

1/2/2 

(32) 

medical 

41.7 ± 1.4 

1/1/4 

(45) 

1/2/7 

(94) 

enron 

36.8 ±2.7 

1/1/12 

(53) 

1/2/119 

(753) 

bibtex 

77.2 ±3.1 

1/1/28 

(159) 

2/2/233 

(2856) 

corel5k 

165.3 ±5.5 

1/1/177 

(374) 

1/2/2294 

(3175) 


Table 5: Global classification accuracies using 4 learning methods 
(BR, MLP, BR+MB, MLP+MB) based on the DAG obtained with 
H2PC and MMHC. Best values between H2PC and MMHC are bold¬ 
faced. 


MLP BR+MB MLP+MB 

MMHC H2PC MMHC H2PC MMHC H2PC 


emotions 

0.327 

0.371 

0.391 

0.147 

0.266 

0.189 

0.285 

yeast 

0.169 

0.273 

0.271 

0.062 

0.155 

0.073 

0.233 

image 

0.342 

0.504 

0.504 

0.227 

0.252 

0.308 

0.360 

scene 

0.580 

0.733 

0.733 

0.123 

0.361 

0.199 

0.565 

slashdot 

0.355 

0.419 

0.474 

0.274 

0.373 

0.278 

0.439 

genbase 

0.069 

0.069 

0.069 

0.974 

0.976 

0.974 

0.976 

medical 

0.454 

0.499 

0.502 

0.630 

0.651 

0.636 

0.669 

enron 

0.134 

0.165 

0.180 

0.024 

0.079 

0.079 

0.157 

bibtex 

0.108 

0.115 

0.167 

0.120 

0.133 

0.121 

0.177 

corel5k 

0.002 

0.030 

0.043 

0.000 

0.002 

0.010 

0.034 
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Table 6: DAG learning time (in seconds) and average degree of the 
label nodes. 


dataset 

time 

MMHC 
label degree 

H2PC 

time label degree 

emotions 

1.5 

2.6 ± 1.1 

16.2 

4.7 ± 1.8 

yeast 

9.0 

3.0 ± 1.6 

90.9 

5.0 ±2.9 

image 

3.1 

4.0 ±0.8 

28.4 

4.6 ± 1.0 

scene 

9.9 

3.1 ± 1.0 

662.9 

6.6 ± 1.8 

slashdot 

44.8 

2.7 ± 1.4 

822.3 

4.0 ±5.2 

genbase 

12.5 

1.0 ±0.3 

12.4 

1.1 ±0.3 

medical 

43.8 

1.7± 1.1 

334.8 

2.9 ± 2.4 

enron 

199.8 

1.2 ± 1.5 

47863.0 

4.3 ±5.2 

bibtex 

960.8 

2.6 ± 1.3 

159469.6 

6.4 ± 10.3 

corel5k 

169.6 

1.5 ± 1.4 

2513.0 

2.9 ±2.8 


of variables which can be included in exact BN anal¬ 
ysis is still limited. As a guide, this might be less than 
about 40 variables for exact structural sea rch techniques 
( Perrier et al. . 2008 : Koiima et al. . 2010t) . In contrast, 
constraint-based heuristics like the one presented in this 
study is capable of processing many thousands of fea¬ 
tures within hours on a personal computer, while main¬ 
taining a very high structural accuracy. H2PC and 
MMHC could potentially handle up to 10,000 labels in 
a few days of single-CPU time and far less by paralleliz- 
ing the skeleton identifica t ion algorithm as discussed i n 


( Tsamardinos et all 2006 : Villanueva & Maciel . 20141) . 

The advantages in terms of structure accuracy and 
its ability to scale to thousands of variables opens up 
many avenues of future possible applications of H2PC 
in various domains as we shall discuss next. For ex¬ 
ample, BNs have especially proven to be useful ab 


stract i ons in computationa l biology (iNagaraian et al 
2013 : Scutari & Nagaraian . 2013 : Prestat et al.L 2013 


Aussem et al.. 20121 201(1 Pena , 2008 : Pena et al 


20051) . Identifying the gene network is crucial for un¬ 
derstanding the behavior of the cell which, in turn, can 
lead to better diagnosis and treatment of diseases. This 
is also of great importance for characterizing the func¬ 
tion of genes and the proteins they encode in determin¬ 
ing traits, psychology, or development of an organism. 
Genome sequencing uses high-throughput techniques 
like DNA microarrays, proteomics, metabolomics and 
mutation analysis to describe the f unction and in 


2006). Learning BN models of gene network 
these huge data is still a difficult task (Badea 

from 

2004; 

Bernard & Hartemink, 20051 ' 

"riedman et al.L 

1999a; 

Ott et al.. 20041 Peer et all 200 

1). In these studies, the 


authors had decide in advance which genes were in¬ 
cluded in the learning process, in all the cases les s than 
1000, and which genes were excluded from it ( Penal 


20081) . H2PC overcome the problem by focusing the 


search around a targeted gene: the k ey step is the iden - 
tification of the vicinity of a node X ( Pena et aU 2005). 

Our second objective in this study was to demonstrate 
the potential utility of hybrid BN structure discovery to 
multi-label learning. In multi-label data where many 
inter-dependencies between the labels may be present, 
explicitly modeling all relationships between the labels 
is intuitively far more reasonable (as demonstrated in 
our experiments). BNs explicitly account for such inter¬ 
dependencies and the DAG allows us to identify an op¬ 
timal set of predictors for every label powerset. The ex¬ 
periments presented here support the conclusion that lo¬ 
cal structural learning in the form of local neighborhood 
induction and Markov blanket is a theoretically well- 
motivated approach that can serve as a powerful learn¬ 
ing framework for label dependency analysis geared 
toward multi-label learning. Multi-label scenarios are 
found in ma ny application doma i ns, such as mu lt imedi a 
annotation ( Snoek et al. . 2006t Trohidis et al. . 20 0^. 


19991 Zhang & Zhou. 

20061). 

protein function classifi- 
, and antiretroviral drug 

cation ( 

Roth & Fischei 

, 20071 

categorization ( 

Borchani et al. 

, 2Q1?). 


7. Conclusion & avenues for future research 


We first discussed a hybrid algorithm for global 
or local (around target nodes) BN structure learn¬ 
ing called Hybrid HPC (H2PC). Our extensive 
experiments showed that H2PC outperforms the 
state-of-the-art MMHC by a significant margin in 
terms of edge recall without sacrificing the number 
of extra edges, which is crucial for the sound¬ 
ness of the super-structure used during the second 
stage of hybrid methods, like the one s proposed in 
( Perrier et al. . 2008 : Koiima et al. . 2010l) . The code of 
H2PC is open-source and publicly available online at 
https://github.com/madbix/bnlearn-clone-3.4, 
Second, we discussed an application of H2PC to the 
multi-label learning problem which is a challenging 
problem in many real-world application domains. We 
established theoretical results, under the faithfulness 
condition, in order to characterize graphically the 
so-called minimal label powersets that appear as 
irreducible factors in the joint distribution and their 
respective Markov boundaries. As far as we know, this 
is the first investigation of Markov boundary principles 
to the optimal variable/feature selection problem in 
multi-label learning. These formal results offer a simple 
guideline to characterize graphically : i) the minimal 
label powerset decomposition, (i.e. into minimal 
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subsets Yip Q Y such that Ypp i Y \ Yip \ X), and 
ii) the minimal subset of features, w.r.t an Information 
Theory criterion, needed to predict each label powerset, 
thereby reducing the input space and the computational 
burden of the multi-label classification. The theoretical 
analysis laid the foundation for a practical multi-label 
classification procedure. Another set of experiments 
were carried out on a broad range of multi-label data 
sets from different problem domains to demonstrate its 
effectiveness. H2PC was shown to outperform MMHC 
by a significant margin in terms of global accuracy. 


We suggest several avenues for future research. As 
far as BN structure learning is concerned, future work 
will aim at: 1 ) ascertaining which independence test 
(e.g. tests targeting specific distributions, employing 
parametri c assumptions etc.) is most suited to the data 
at hand dTsamardinos & Borboudal^ 2010l: Scutaril 


201 11 ) : 2 ) controlling the false dis covery rate of the 


ry rate c 

3 l20n8h 


edges in the graph output by H2PC (iPenaL IZtUkll) espe¬ 
cially when dealing with more nodes than samples, e.g. 
learning gene networks from gene expression data. In 
this study, H2PC was run independently on each node 
without keeping track of the dependencies found previ¬ 
ously. This lead to some loss of efficiency due to re¬ 
dundant calculations. The optimization of the H2PC 
code is currently being undertaken to lower the compu¬ 
tational cost while maintaining its performance. These 
optimizations will include the use of a cache to store 
the (in)dependencies and the use of a global structure. 
Other interesting research avenues to explore are exten¬ 
sions and modifications to the greedy search-and-score 
procedure which is the most promising part to optimize 
in terms of computational gains. Regarding the multi¬ 
label learning problem, experiments with several thou¬ 
sands of labels are currently been conducted and will 
be reported in due course. We also intend to work on 
relaxing the faithfulness assumption and derive a prac¬ 
tical multi-label classification algorithm based on H2PC 
that is correct under milder assumptions underlying the 
joint distribution (e.g. Composition, Intersection). This 
needs further substantiation through more analysis. 
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Appendix 

Lemmn 8. ^Ypp.^ Ypp. £ LP, than Ypp. n Ypp. £ LP. 

Proof. To keep the notation uncluttered and for the sake 
of simplicity, consider a partition {Yi,Y 2 ,Y 3 ,Y 4 } ofY 
such that Ypp. = Yi U Y 2 , Ypp. = Y 2 U Y 3 . From 
the label powerset assumption for Ypp. and Ypp., we 
have (Yi U Y 2 ) X (Y 3 U Y 4 ) | X and (Y 2 U Y 3 ) X 
(Yi U Y 4 ) I X . From the Decomposition, we have that 
Y 2 X (Y 3 U Y 4 ) I X . Using the Weak union, we obtain 
that Y 2 X Yi I (Y 3 U Y 4 U X) . From these two facts, 
we can use the Contraction property to show that Y 2 X 
(Yi U Y 3 U Y 4 ) I X . Therefore, Y 2 = Ypp^ n Ypp. is a 
label powerset by definition. 

Lemma 9. Let Y, and Yj denote two distinct labels in 
Y and define by Ypp. andYpp. their respective minimal 
label powerset. Then we have, 

3Z c Y \ {Yi, Yj], {Yi) A {Yj} | (X U Z) ^ Y^^, = Y^^, 

Proof. Let us suppose Ypp, + Ypp By the label pow¬ 
erset definition for Ypp,, we have Ypp, X Y \ Ypp, \ X. 
As Ypp, r\ Ypp. = 0 owing to Lemma [ 8 ] we have 
that Yj i Ypp,. VZ £ Y \ {Yi,Yj}, Z can be decom¬ 
posed as Z; U Zj such that Z; = Z n (Ypp^ \ {F,}) and 
Zj - Z \ Z, . Using the Decomposition property, we 
have that ({F,) U Z,) X ({Fj} U Zj) \ X . Using the Weak 
union property, we have that {F,} X {Yj} \ (X U Z) . As 
this is true VZ £ Y \ {F,-, Yj}, then i^Z £ Y \ {F,-, Yj}, (F,} / 
X (Fj) I (X U Z), which completes the proof. 


Lemma 10. Consider Ypp a minimal label powerset. 
Then, if p satisfies the Composition property, Z JL Ypp\ 

Z I X. 

Proof. By contradiction, suppose a nonempty Z exists, 
such that Z JL Ypp \ Z \ X . From the label powerset 
assumption of Y^p, we have that Ypp X Y \ Ypp \ X 
. From these two facts, we can use the Composition 
property to show that Z X Y \ Z | X which contradicts 
the minimal label powerset assumption of Ypp. This 
concludes the proof. 
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Theorem 7. Suppose p is faithful to a DAG Q. Then, 
Yi and Yj belong to the same minimal label powerset if 
and only if there exists an undirected path in Q between 
nodes Yi and Yj in Y such that all intermediate nodes Z 
are either (i) Z £ Y, or (ii)Z £ X and Z has two parents 
in Y (i.e. a collider of the form Yp ^ X ^ Yq). 
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Proof. Suppose such a path exists. By conditioning on 
all the intermediate colliders in Y of the form Yp —> 
Yk Yq along an undirected path in Q between nodes 
Yi and Yj, we ensure that 3Z c Y \ {F,, Yj), dSep(F/, Yj \ 
X U Z). Due to the faithfulness, this is equivalent to 
{Yi} {Fj} I (X U Z). From Lemma|3 we conclude that 
F, and Yj belong to the same minimal label powerset. 
To show the inverse, note that owing to Lemma [10] we 
know that there exists no partition {Z,-, Z^j of Z such that 
({F,} U Z,) i ({Fj} U Zj) I X. Due to the faithfulness, 
there exists at least a link between ({F,} U Z,) and ({Fy} U 
Zj) in the DAG, hence by recursion, there exists a path 
between F, and Yj such that all intermediate nodes Z are 
either (i) Z e Y, or (ii) Z e X and Z has two parents in 
Y (i.e. a collider of the form Yp X Yq). 


Theorem 8. Suppose p is faithful to a DAG Let 
'Ylp — {Yi,Y 2, . ■. ,Yn] be a label powerset. Then, its 
Markov boundary M in U is also its Markov boundary 
in X, and is given in Q by^ — U SPy^) \ Y. 

Proof. First, we prove that M is a Markov boundary 
of Yz,p in U. Define M, the Markov boundary of F, in 
U, and M', = M,- \ Y. From Theorem [5] M,- is given 
in g by M,- = PCy^ U SPy,. We may now prove that 
M'l U- ■ -UM'n is a Markov boundary of {Fi, F2,..., F„} 
in U. We show first that the statement holds for n - 2 
and then conclude that it holds for all n by induction. 
Let W denote U \ (Yi U Y2 U M'l U M'2), and define 
Yi = {Fi} and Y2 = {F2}. From the Markov blanket 
assumption for Yi we have Yi i U \ (Yi U Mj) | Mi 
. Using the Weak Union property, we obtain that Y1 i 
W I M'l U M'2 U Y2 . Similarly we can derive Y2 X 
W I M'2 U M'l U Yi . Combining these two statements 
yields Yi U Y2 i W | M'l UM' 2 due to the Intersection 
property. Let M = M' 1 U M'2 the last expression can be 
formulated asYi UY2 X U\(Yi UY2 U M) | M which 
is the definition of a Markov blanket of Yi U Y2 in U. 
We shall now prove that M is minimal. Let us suppose 
that it is not the case, i.e., 3Z c M such that Yi U Y2 X 
ZU(U\(YiUY 2UM)) |M\Z. DefineZi =ZnMi and 
Z2 = ZnM2, we may apply the Weak Union property to 
getYi X Zi I (M\Z)UY2UZ2 U(U\(YuM)) which can 
be rewritten more compactly as Yi X Zi | (Mi \ Zi) U 
(U\(Yi U Ml)). From the Markov blanket assumption, 
we have Yi X U \ (Yi U Mi) | Mi . We may now 
apply the Intersection property on these two statements 
to obtain Yi X ZiU(U\(YiUMi)) | Mi\Zi . Similarly, 
we can derive Y2 X Z2 U (U \ (Y2 U M2)) | M2 \ Z2 . 
From the Markov boundary assumption of Mi and M2, 
we have necessarily Zi = 0 and Z2 = 0, which in turn 


yields Z = 0. To conclude for any n > 2, it suffices to set 
YI = U”=i {U/! and Y 2 = {F„} to conclude by induction. 

Second, we prove that M is a Markov blanket of Y/.^ 
in X. Define Z = M n Y. From the label powerset 
definition, we have \ip X Y \Yip \ X . Using the 
Weak Union property, we obtain Y 2 ,f X Z | U\(Y/,pUZ) 
which can be reformulated as Ypp X Z | (U \ (Ypp U 
M)) U (M \ Z). Now, the Markov blanket assumption for 
Ypp in U yields Ypp X U \ (Ypp UM) | M which can be 
rewritten as Ypp X U \ (Ypp U M) | (M \ Z) U Z . From 
the Intersection property, we get Ypp X Z U (U \ (Ypp U 
M)) I M \ Z . From the Markov boundary assumption 
of M in U, we know that there exists no proper subset 
of M which satisfies this statement, and therefore Z = 
M n Y = 0. From the Markov blanket assumption of M 
in U, we have Ypp X U \ (Ypp U M) | M . Using the 
Decomposition property, we obtain Ypp X X \ M | M 
which, together with the assumption M n Y = 0, is the 
definition of a Markov blanket of Ypp in X. 

Finally, we prove that M is a Markov boundary of 
Ypp in X. Let us suppose that it is not the case, i.e., 
3Z c M such that Ypp X Z U (X \ M) | M \ Z 
. From the label powerset assumption of Ypp, we 
have Ypp X Y \ Ypp | X which can be rewritten as 
Ypp X Y \ Ypp I (X \ M) U (M \ Z) U Z . Due to the 
Contraction property, combining these two statements 
yields Ypp X Z U (U \ (M U Ypp) | M \ Z . From 
the Markov boundary assumption of M in U, we have 
necessarily Z - which suffices to prove that M is a 
Markov boundary of Ypp in X. 
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Figure 3: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Emotions and Yeast. 
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(b) Scene 

Figure 4: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Image and Scene. 
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(a) Slashdot 



(b) Genbase 

Figure 5: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Slashdot and Genbase. 
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(b) Corel5k 



Figure 7: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Bibtex and Corel5k. 
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